# +~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~ #  
#
#' @title  Evaluate the content validity of our XLM-T elite criticism classifier
#' @author Hauke Licht
#
# +~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~ #

# setup ----

# load packages
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
library(future)
n_workers <- sum(future::availableCores())
plan(multisession, workers = n_workers)
library(furrr)
library(quanteda) # v3.3.1
options("quanteda_verbose" = FALSE)
options("quanteda_threads" = n_workers)
library(ggplot2)

# data paths 

base_path <- file.path(".")
data_path <- file.path(base_path, "data")
input_path <- file.path(data_path, "input")
output_path <- file.path(data_path, "output")

# helpers
helpers_path <- file.path(base_path, "code", "helpers")
source(file.path(helpers_path, "text_preproc.R")) # for `tokenize.corpus()`
source(file.path(helpers_path, "figthin_words.R")) # for `textstat_fighting_words()`

# custom helper

#' compute fighting words 
#' 
#' @param x a data frame with columns 'doc_id' and 'text'
#' @param .lang unit-length character vector, specifying language
#' @param .k integer specifying top-*k* terms to extract by group
#' 
#' @returns a list with elements 
#'   \enumerate{
#'     \item{'fit' (list): fighting words estimates}
#'     \item{'topk' (data.frame): top-k terms by group and comparison}
#'   }
compute_fighting_words <- function(x, .lang, grp.var, .k = 10L, ..., .verbose = FALSE) {
  if (.verbose)
    message("processing data for language ", dQuote(.lang))
  
  dat <- x %>%
    corpus(docid_field = "doc_id", text_field = "text") %>%
    tokenize.corpus(
      lang = .lang
      , ...
      # use ISO stop words to ensure large language coverage
      , stopwords = stopwords::stopwords(.lang, source = "stopwords-iso")
      , .verbose = .verbose
    ) %>% 
    dfm(tolower = FALSE, verbose = .verbose) 
  
  # use empirical Bayes approach (i.e., get prior pseudo-counts from data)
  fw <- textstat_fighting_words.dfm(dat, group.var = grp.var)
  
  out <- list(fit = fw)
  
  out$topk <- map(fw, function(x) {
    x %>% 
      group_by_at(vars(!!grp.var)) %>% 
      top_n(.k, abs(z_score)) %>% 
      ungroup() %>%
      arrange(z_score)
  })
  
  return(out)
} 

# load data ----

all_tweets_labeled <- read_rds(file.path(output_path, "parl_party_tweets_labeled.rds"))

# obtain fighting words ---

# note: focus on tweets in English and German-speaking in countries and in Spain

country_langs <- all_tweets_labeled %>% 
  filter(political == "yes") %>% 
  filter(
    lang %in% c("de", "en")
    | 
    country_iso3c == "ESP"
  ) %>% 
  count(country_iso3c, lang) %>% 
  filter(n >= 1000)

# estimate fighting words
tmp <- all_tweets_labeled %>% 
  filter(political == "yes") %>% 
  inner_join(select(country_langs, -n)) %>% 
  mutate(doc_id = sprintf("%s_%s_%s_%s", country_iso3c, party_name_short, user_id, status_id)) %>% 
  distinct(country_iso3c, doc_id, text, lang, prob_elitecriticism, elitecriticism) %>% 
  mutate(
    text = stringr::str_replace_all(
      text
      , c(
        "https?://\\S+" = "[URL]"
        , "www\\.\\S+" = "[URL]"
        , "[^@ ]+\\.com" = "[URL]"
        , "[^@ ]+\\.co\\.\\S+" = "[URL]"
        , "\\S+@\\S+" = "[EMAIL]"
      )
    )
  )

fw_dat <- group_split(tmp, country_iso3c, lang)
names(fw_dat) <- map_chr(fw_dat, ~paste(.[1, c("country_iso3c", "lang")], collapse = "-"))

fighting_words <- future_map2(
  fw_dat
  , sub("^[A-Z]{3}-", "", names(fw_dat))
  , compute_fighting_words
  , grp.var = "elitecriticism"
  , .k = 50L
  , ngrams = 1:2
  , stem = FALSE
  , tolower = FALSE
  , .verbose = FALSE
  , .progress = TRUE
  , .options = furrr_options(seed = 1234L, packages = c("quanteda", "Matrix"))
)

names(fighting_words) <- names(fw_dat)

for (k in names(fighting_words)) {
  out <- list()
  
  out$topk <- fighting_words[[k]]$topk[[1]] %>% 
    arrange(elitecriticism, desc(abs(z_score))) %>% 
    select(group, elitecriticism, feature, z_score)
  
  out$topk_strings <- out$topk %>% 
    group_by(elitecriticism) %>%
    summarise(
      topk = paste(sprintf("`%s' (%-.02f)", feature, z_score), collapse = ", ")
      , .groups = "keep"
    )
  
  fighting_words[[k]]$summarized <- out
}

# write to disk ----

fp <- file.path(output_path, "validation", "fighting_words_xmlt_classifier.rds")
if (!file.exists(fp)) {
  dir.create(file.path(output_path, "validation"), showWarnings = FALSE)
  write_rds(fighting_words, fp)
}
